{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binary Classification\n",
    "\n",
    "**This notebook follows exactly the tutorial from Kaggle: [data analysis framework from Kaggle](https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy). Thanks to the author [LD Freeman](https://www.kaggle.com/ldfreeman3) for creating such a great tutorial! My goal is to create an Apache Spark version using the same framework.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Spark entry points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from pyspark import SparkConf, SparkContext\n",
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sc = SparkContext(conf=SparkConf())\n",
    "spark = SparkSession(sparkContext=sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1: define the problem\n",
    "\n",
    "** What sorts of people were likely to survive from the Titanic accident?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: gather the data\n",
    "\n",
    "The datasets can be found here: [https://www.kaggle.com/c/titanic/data](https://www.kaggle.com/c/titanic/data).\n",
    "It is also available in this github repository:\n",
    "    \n",
    "* [data/titanic/gender_submission.csv](data/titanic/gender_submission.csv)\n",
    "* [data/titanic/test.csv](data/titanic/test.csv)\n",
    "* [data/titanic/train.csv](data/titanic/train.csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: prepare data for consumption"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Import libraries\n",
    "### 3.11 Import python packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python version: 3.6.4 |Anaconda custom (64-bit)| (default, Dec 21 2017, 21:42:08) \n",
      "[GCC 7.2.0]\n",
      "Python version: 0.20.3\n",
      "matplotlib version: 2.1.0\n",
      "numpy version: 1.13.3\n",
      "scipy version: 1.0.0\n",
      "IPython version: 6.1.0\n",
      "Apache Spark Pyspark version: 2.2.1\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "# load packages\n",
    "import sys\n",
    "print('Python version: {}'. format(sys.version))\n",
    "\n",
    "import pandas as pd\n",
    "print('Python version: {}'. format(pd.__version__))\n",
    "\n",
    "import matplotlib\n",
    "print('matplotlib version: {}'. format(matplotlib.__version__))\n",
    "\n",
    "import numpy as np\n",
    "print('numpy version: {}'. format(np.__version__))\n",
    "\n",
    "import scipy as sp\n",
    "print('scipy version: {}'. format(sp.__version__))\n",
    "\n",
    "import IPython\n",
    "from IPython import display # pretty printing of dataframe in Jupyter notebook\n",
    "print('IPython version: {}'. format(IPython.__version__))\n",
    "\n",
    "import pyspark\n",
    "print('Apache Spark Pyspark version: {}'. format(pyspark.__version__)) # pyspark version\n",
    "\n",
    "# misc libraries\n",
    "import random\n",
    "import time\n",
    "\n",
    "# ignore warnings\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "print('-'*25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.12 Import Pyspark Models for binary classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LinearSVC\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.ml.classification import DecisionTreeClassifier\n",
    "from pyspark.ml.classification import GBTClassifier\n",
    "from pyspark.ml.classification import RandomForestClassifier\n",
    "from pyspark.ml.classification import NaiveBayes\n",
    "from pyspark.ml.classification import MultilayerPerceptronClassifier\n",
    "from pyspark.ml.classification import OneVsRest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Meet and greet data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------- datasets ----------\n",
      "gender_submission.csv\n",
      "test.csv\n",
      "train.csv\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from subprocess import check_output\n",
    "print('-'*10, 'datasets', '-'*10)\n",
    "print(check_output(['ls', 'data/titanic']).decode('utf8'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------- data types ----------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PassengerId</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Survived</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Pclass</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Name</td>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Sex</td>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Age</td>\n",
       "      <td>double</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>SibSp</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Parch</td>\n",
       "      <td>int</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Ticket</td>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Fare</td>\n",
       "      <td>double</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Cabin</td>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Embarked</td>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              0       1\n",
       "0   PassengerId     int\n",
       "1      Survived     int\n",
       "2        Pclass     int\n",
       "3          Name  string\n",
       "4           Sex  string\n",
       "5           Age  double\n",
       "6         SibSp     int\n",
       "7         Parch     int\n",
       "8        Ticket  string\n",
       "9          Fare  double\n",
       "10        Cabin  string\n",
       "11     Embarked  string"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import data\n",
    "# we will split the train data into train and test data in future sections\n",
    "data_raw = spark.read.csv('data/titanic/train.csv', inferSchema=True, header=True)\n",
    "\n",
    "# the test file provided is for validation of final model.\n",
    "data_val = spark.read.csv('data/titanic/test.csv', inferSchema=True, header=True)\n",
    "\n",
    "# preview the data\n",
    "# data type\n",
    "print('-'*10, 'data types', '-'*10)\n",
    "pd.DataFrame(data_raw.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------- data summary ----------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>summary</th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>count</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>714</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>204</td>\n",
       "      <td>889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>mean</td>\n",
       "      <td>446.0</td>\n",
       "      <td>0.3838383838383838</td>\n",
       "      <td>2.308641975308642</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>29.69911764705882</td>\n",
       "      <td>0.5230078563411896</td>\n",
       "      <td>0.38159371492704824</td>\n",
       "      <td>260318.54916792738</td>\n",
       "      <td>32.2042079685746</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>stddev</td>\n",
       "      <td>257.3538420152301</td>\n",
       "      <td>0.48659245426485753</td>\n",
       "      <td>0.8360712409770491</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>14.526497332334035</td>\n",
       "      <td>1.1027434322934315</td>\n",
       "      <td>0.8060572211299488</td>\n",
       "      <td>471609.26868834975</td>\n",
       "      <td>49.69342859718089</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>min</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>\"Andersson, Mr. August Edvard (\"\"Wennerstrom\"\")\"</td>\n",
       "      <td>female</td>\n",
       "      <td>0.42</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>110152</td>\n",
       "      <td>0.0</td>\n",
       "      <td>A10</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>max</td>\n",
       "      <td>891</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>van Melkebeke, Mr. Philemon</td>\n",
       "      <td>male</td>\n",
       "      <td>80.0</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>WE/P 5735</td>\n",
       "      <td>512.3292</td>\n",
       "      <td>T</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  summary        PassengerId             Survived              Pclass  \\\n",
       "0   count                891                  891                 891   \n",
       "1    mean              446.0   0.3838383838383838   2.308641975308642   \n",
       "2  stddev  257.3538420152301  0.48659245426485753  0.8360712409770491   \n",
       "3     min                  1                    0                   1   \n",
       "4     max                891                    1                   3   \n",
       "\n",
       "                                               Name     Sex  \\\n",
       "0                                               891     891   \n",
       "1                                              None    None   \n",
       "2                                              None    None   \n",
       "3  \"Andersson, Mr. August Edvard (\"\"Wennerstrom\"\")\"  female   \n",
       "4                       van Melkebeke, Mr. Philemon    male   \n",
       "\n",
       "                  Age               SibSp                Parch  \\\n",
       "0                 714                 891                  891   \n",
       "1   29.69911764705882  0.5230078563411896  0.38159371492704824   \n",
       "2  14.526497332334035  1.1027434322934315   0.8060572211299488   \n",
       "3                0.42                   0                    0   \n",
       "4                80.0                   8                    6   \n",
       "\n",
       "               Ticket               Fare Cabin Embarked  \n",
       "0                 891                891   204      889  \n",
       "1  260318.54916792738   32.2042079685746  None     None  \n",
       "2  471609.26868834975  49.69342859718089  None     None  \n",
       "3              110152                0.0   A10        C  \n",
       "4           WE/P 5735           512.3292     T        S  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# data summary\n",
    "print('-'*10, 'data summary', '-'*10)\n",
    "data_raw.describe().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---------- randomely sample 1% data to view ----------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>71</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Jenkin, Mr. Stephen Curnow</td>\n",
       "      <td>male</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>C.A. 33111</td>\n",
       "      <td>10.5000</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Dowdell, Miss. Elizabeth</td>\n",
       "      <td>female</td>\n",
       "      <td>30.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>364516</td>\n",
       "      <td>12.4750</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Robins, Mrs. Alexander A (Grace Charity Laury)</td>\n",
       "      <td>female</td>\n",
       "      <td>47.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5. 3337</td>\n",
       "      <td>14.5000</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>138</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mr. Jacques Heath</td>\n",
       "      <td>male</td>\n",
       "      <td>37.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>193</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Andersen-Jensen, Miss. Carla Christine Nielsine</td>\n",
       "      <td>female</td>\n",
       "      <td>19.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>350046</td>\n",
       "      <td>7.8542</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>459</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>Toomey, Miss. Ellen</td>\n",
       "      <td>female</td>\n",
       "      <td>50.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>F.C.C. 13531</td>\n",
       "      <td>10.5000</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>526</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Farrell, Mr. James</td>\n",
       "      <td>male</td>\n",
       "      <td>40.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>367232</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>None</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>774</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Elias, Mr. Dibo</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2674</td>\n",
       "      <td>7.2250</td>\n",
       "      <td>None</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>777</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Tobin, Mr. Roger</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>383121</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>F38</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>784</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Mr. Andrew G</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0           71         0       2   \n",
       "1           80         1       3   \n",
       "2          133         0       3   \n",
       "3          138         0       1   \n",
       "4          193         1       3   \n",
       "5          459         1       2   \n",
       "6          526         0       3   \n",
       "7          774         0       3   \n",
       "8          777         0       3   \n",
       "9          784         0       3   \n",
       "\n",
       "                                              Name     Sex   Age  SibSp  \\\n",
       "0                       Jenkin, Mr. Stephen Curnow    male  32.0      0   \n",
       "1                         Dowdell, Miss. Elizabeth  female  30.0      0   \n",
       "2   Robins, Mrs. Alexander A (Grace Charity Laury)  female  47.0      1   \n",
       "3                      Futrelle, Mr. Jacques Heath    male  37.0      1   \n",
       "4  Andersen-Jensen, Miss. Carla Christine Nielsine  female  19.0      1   \n",
       "5                              Toomey, Miss. Ellen  female  50.0      0   \n",
       "6                               Farrell, Mr. James    male  40.5      0   \n",
       "7                                  Elias, Mr. Dibo    male   NaN      0   \n",
       "8                                 Tobin, Mr. Roger    male   NaN      0   \n",
       "9                           Johnston, Mr. Andrew G    male   NaN      1   \n",
       "\n",
       "   Parch        Ticket     Fare Cabin Embarked  \n",
       "0      0    C.A. 33111  10.5000  None        S  \n",
       "1      0        364516  12.4750  None        S  \n",
       "2      0     A/5. 3337  14.5000  None        S  \n",
       "3      0        113803  53.1000  C123        S  \n",
       "4      0        350046   7.8542  None        S  \n",
       "5      0  F.C.C. 13531  10.5000  None        S  \n",
       "6      0        367232   7.7500  None        Q  \n",
       "7      0          2674   7.2250  None        C  \n",
       "8      0        383121   7.7500   F38        Q  \n",
       "9      2    W./C. 6607  23.4500  None        S  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view a small subset of the data\n",
    "print('-'*10, 'randomely sample 1% data to view', '-'*10)\n",
    "data_raw.randomSplit([0.01, 0.99])[0].toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.21 The 4 C's of data clearning: Correcting, Completing, Creating, and Converting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------\n",
      "0: is not NULL\n",
      "1: is NULL\n",
      "-------------------------\n",
      "                         \n",
      "+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+\n",
      "|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|\n",
      "+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  1|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|\n",
      "+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# we first check which values are NULL values for each column\n",
    "# then we convert the boolean values to int (0 and 1), then we can count how many 1's exist in each column.\n",
    "print('-'*25)\n",
    "print('0: is not NULL')\n",
    "print('1: is NULL')\n",
    "print('-'*25)\n",
    "print(' '*25)\n",
    "# we build column strings and then use eval() to convert strings to column expressions.\n",
    "data_raw.select([eval('data_raw.' + x + '.isNull().cast(\"int\").alias(\"' + x + '\")') for x in data_raw.columns]).show(n=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train columns with null values:\n",
      "-------------------------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sum(PassengerId)</th>\n",
       "      <th>sum(Survived)</th>\n",
       "      <th>sum(Pclass)</th>\n",
       "      <th>sum(Name)</th>\n",
       "      <th>sum(Sex)</th>\n",
       "      <th>sum(Age)</th>\n",
       "      <th>sum(SibSp)</th>\n",
       "      <th>sum(Parch)</th>\n",
       "      <th>sum(Ticket)</th>\n",
       "      <th>sum(Fare)</th>\n",
       "      <th>sum(Cabin)</th>\n",
       "      <th>sum(Embarked)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>177</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>687</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sum(PassengerId)  sum(Survived)  sum(Pclass)  sum(Name)  sum(Sex)  \\\n",
       "0                 0              0            0          0         0   \n",
       "\n",
       "   sum(Age)  sum(SibSp)  sum(Parch)  sum(Ticket)  sum(Fare)  sum(Cabin)  \\\n",
       "0       177           0           0            0          0         687   \n",
       "\n",
       "   sum(Embarked)  \n",
       "0              2  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Train columns with null values:')\n",
    "print('-'*25)\n",
    "data_raw.select([eval('data_raw.' + x + '.isNull().cast(\"int\").alias(\"' + x + '\")') for x in data_raw.columns]).\\\n",
    "    groupBy().sum().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test columns with null values:\n",
      "-------------------------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sum(PassengerId)</th>\n",
       "      <th>sum(Pclass)</th>\n",
       "      <th>sum(Name)</th>\n",
       "      <th>sum(Sex)</th>\n",
       "      <th>sum(Age)</th>\n",
       "      <th>sum(SibSp)</th>\n",
       "      <th>sum(Parch)</th>\n",
       "      <th>sum(Ticket)</th>\n",
       "      <th>sum(Fare)</th>\n",
       "      <th>sum(Cabin)</th>\n",
       "      <th>sum(Embarked)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>86</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>327</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sum(PassengerId)  sum(Pclass)  sum(Name)  sum(Sex)  sum(Age)  sum(SibSp)  \\\n",
       "0                 0            0          0         0        86           0   \n",
       "\n",
       "   sum(Parch)  sum(Ticket)  sum(Fare)  sum(Cabin)  sum(Embarked)  \n",
       "0           0            0          1         327              0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Test columns with null values:')\n",
    "print('-'*25)\n",
    "data_val.select([eval('data_val.' + x + '.isNull().cast(\"int\").alias(\"' + x + '\")') for x in data_val.columns]).\\\n",
    "    groupBy().sum().toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.22 Clean data\n",
    "\n",
    "#### COMPLETE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# COMPLETE: complete or delete missing values in train and test/validation dataset.\n",
    "\n",
    "# complete missing age with median\n",
    "\n",
    "# complete missing embarked with mode\n",
    "\n",
    "# complete missing fare with median"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame[Age: double]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_raw.select('Age')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
